Imputing data for temporal data store joins

ABSTRACT

A request may be received to join one or more attributes of at least two independent sets of data into a data structure. The one or more attributes may include a time attribute. The two independent sets of data may be included within a data store. It may be determined that there are one or more null values associated with the join to the data structure. In response to at least the determining that there are one or more null values associated with the join, one or more values may be imputed into one or more fields corresponding to the one or more null values, wherein there are no null values in the one or more fields subsequent to the imputing.

BACKGROUND

This disclosure relates generally to data management systems, and morespecifically, to performing temporal join operations to process a datastore query.

Database joins are arguably the most important relational operatorsbecause efficient join processing may be expensive to compute, but areessential for the overall efficiency of a query processor. A joinoperation combines one or more attributes (e.g., columns) of data fromone or more data structures (e.g., tables) in a data store.Consequently, join operations allow a user to analyze independent ordifferent sets of data at one time or in a single view. For example, ina relational database, a user may issue a Structured Query Language(SQL) query that performs an “inner join” between two tables to join thetwo tables' matching data. In this example, an inner join selects eachrecord from both tables where the join condition is met. That is, eachrecord of the combined table has corresponding values such there are nonull data values. In temporal data stores, join operations involvingtime carry great significance. A temporal data store refers to any datastore where some form of time is an attribute that is included in one ormore data structures of the data store.

SUMMARY

One or more embodiments are directed to a computer-implemented method, asystem, and a computer program product for performing a join operationin a data store. A first set of values may be received. The first set ofvalues may be sampled at a first time series interval. The first set ofvalues may be populated into a first data structure. A second set ofvalues may be received. The second set of values may be sampled at asecond time series interval. The second set of values may be populatedinto a second data structure. A join request may be received to join oneor more attributes of the first data structure and one or moreattributes of the second data structure. The request may includecombining the one or more attributes of the second data structure withthe first time interval. It may be determined that there are one or morenull results associated with the request. A prediction estimate of whatset of values would be represented by the one or more null results hadthe one or more attributes of the second data structure been sampled atthe first time interval may be generated in response to the determining.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a computing environment, according toembodiments.

FIG. 2 illustrates a stream computing infrastructure that may beconfigured to execute a stream computing application, according toembodiments.

FIG. 3 is relational database table diagram of a temporal left joinoperation, according to embodiments.

FIG. 4A is a time series diagram illustrating how data may be estimatedand imputed, according to embodiments.

FIG. 4B is a time series diagram illustrating how data may be estimatedand imputed, according to embodiments.

FIG. 5 is a flow diagram of an example process for imputing data as partof a join operation, according to embodiments.

FIG. 6 is a block diagram of a computing device that includes a imputeengine, according to embodiments.

While the invention is amenable to various modifications and alternativeforms, specifics thereof have been shown by way of example in thedrawings and will be described in detail. It should be understood,however, that the intention is not to limit the invention to theparticular embodiments described. On the contrary, the intention is tocover all modifications, equivalents, and alternatives falling withinthe spirit and scope of the invention.

DETAILED DESCRIPTION

Aspects of the present disclosure relate to performing particular joinoperations to process a query, such as in a temporal relationaldatabase, graph database, and/or a data stream management system. Whilethe present disclosure is not necessarily limited to such applications,various aspects of the disclosure may be appreciated through adiscussion of various examples using this context.

Time is relevant in most real-world phenomena. For example, it may beimportant to determine how long an employee worked under a particulartitle at work. Temporal data stores may thus specify each time intervalan employee worked under a particular title. Any form of time may bemeasured. For example, temporal databases may include valid time,transaction time, and/or bitemporal data. Valid time is the time atwhich some fact becomes true or changes in the real world (e.g., thetime at which an employee gets promoted). Transaction time refers to theupdate operation (e.g., a DELETE, INSERT, etc.) time in the databaseitself, as opposed to the real world. Bitemporal data combines bothtransaction time and valid time. A “temporal join” refers to combiningone or more time-related attributes of data from one or more datastructures in a data store. A “data structure” may refer to any logicalobject or format for organizing and/or storing data. For example, a datastructure may be or include a table, node (e.g., a graph node of a graphdatabase), column, array, file, record, and/or hash table.

Performing temporal joins in temporal data stores may be problematicwhere data is only sampled or obtained at particular time intervals (asopposed to continuously), or different sets of data are sampled atdifferent time intervals. These problems may arise in applicationsinvolving correlation analysis, forecasting, real-time sensor analysis,etc. In an illustrative example, a first sensor (e.g., a pulse oximeter)may measure data every 1 minute and its values may be transmitted fromthe first sensor to a database. However, a second sensor may measure asecond set of data (e.g., the outside temperature) every 5 minutes andtransmit the data to the same database. However, a user may desire tojoin data associated with these unaligned time series values. But if theuser, for example, requests a join (e.g., a left join) to view all thevalues associated with the time stamp of the first sensor (i.e., dataaccumulated every one minute) with values associated with the time stampof the second set of data (i.e., data accumulated every five minutes),there may be various null results returned in the query becausemeasurement of both sets of data occurred at different points in time.The problem with this is that even if data is not measured at all pointsin time (i.e., there are missing or null values), it does not followthat the data did not itself exist at those points in time where thedata was not measured. Accordingly, various embodiments of the presentdisclosure populate values that would otherwise be null, with predictivevalues using one or more algorithms (e.g., interpolation, last knownvalue, etc.), as described in more detail below.

FIG. 1 is a block diagram of a computing environment 100, according toembodiments. The computing environment 100 may include one or morecompute nodes, such as compute nodes 106, which is communicativelycoupled to the user device 102 via the network 108. In some embodiments,the computing environment 100 may be implemented within a cloudcomputing environment, or use one or more cloud computing services.Consistent with various embodiments, a cloud computing environment mayinclude a network-based, distributed data processing system thatprovides one or more cloud computing services. Further, a cloudcomputing environment may include many computers, hundreds or thousandsof them or more, disposed within one or more data centers and configuredto share resources over the network 108.

Consistent with some embodiments, the compute node 106 and/or the userdevice 102 may be configured the same or analogous to the computingdevice 600, as illustrated in FIG. 6. In some computing environments,more or fewer components (e.g., compute nodes) may be present thanillustrated in FIG. 1. In various embodiments, the compute node 106represents a server computing device(s) and/or a particular computeinstance of a single computing device (e.g., computing components withina chassis, a blade server within a blade enclosure, an I/O drawer, aprocessor chip, etc.). The user device 102 may be any suitable devicethat transmits data to the compute node 106. For example, the userdevice 102 may be or include a sensor(s), a desktop, laptop, handhelddevice (e.g., a mobile phone), etc.

The user device 102 may communicate with the compute node 106 via anysuitable network 108. For example, the network 108 may be a local areanetwork (LAN), a general wide area network (WAN), and/or a publicnetwork (e.g., the Internet).

In various embodiments, the database 112 is any suitable database type.For example, the database 112 may be or include relational databases,multidimensional or online analytical procession (OLAP) databases,online transactional processing databases (OLTP), graph databases,real-time databases, column-oriented databases, database warehouses,operational databases, or any other suitable data store type. In someembodiments, instead of the database 112, a data stream managementsystem may be implemented within the computing environment 100. Any oneof these databases and/or data stream management system may beconsidered temporal data stores for purposes of this disclosure if andonly if they include a time-related attribute(s).

FIG. 1 illustrates how temporal joints may be performed in the computingenvironment 100. For example, a user may propose a query on the userdevice 102. The query may be transmitted to the compute node 106 via thenetwork 108. The database manager 120 and specifically the queryprocessor 122 may receive and run the query. The query processor 122 mayperform various steps to process the query such as optimize the query inorder to choose the most efficient query execution plan (or simplychoose a plan) and execute the chosen plan. The optimizer (not shown)may include the impute engine 126, which may impute data for temporaljoins within the database 112, as described in more detail below.

Stream-based computing and stream-based database computing are emergingas a developing technology for a variety of applications. For example,products are available which allow users to create applications thatprocess and query streaming data before it reaches a database file. Withthis emerging technology, users can specify processing logic to apply toinbound data records while they are “in flight,” with the resultsavailable in a very short amount of time, often in fractions of asecond. Constructing an application using this type of processing hasopened up a new programming paradigm that will allow for development ofa broad variety of innovative applications, systems, and processes, aswell as present new challenges for application programmers and databasedevelopers.

In a stream computing application, stream operators are connected to oneanother such that data flows from one stream operator to the next (e.g.,over a TCP/IP socket). When a stream operator receives data, it mayperform operations, such as analysis logic, which may change the tuple(further defined herein) by adding or subtracting attributes, orupdating the values of existing attributes within the tuple. When theanalysis logic is complete, a new tuple is then sent to the next streamoperator. Scalability is achieved by distributing an application acrossnodes by creating executables (i.e., processing elements), as well asreplicating processing elements on multiple nodes and load balancingamong them. Stream operators in a stream computing application can befused together to form a processing element that is executable. Doing soallows processing elements to share a common process space, resulting inmuch faster communication between stream operators than is availableusing some inter-process communication techniques. Further, processingelements can be inserted or removed dynamically from an operator graphrepresenting the flow of data through the stream computing application.A particular stream operator may not reside within the same operatingsystem process as other stream operators. Stream operators in the sameoperator graph may be hosted on different nodes, e.g., on differentcompute nodes or on different cores of a compute node.

Data flows from one stream operator to another in the form of a “tuple.”A tuple is a sequence or row of one or more attribute values associatedwith an entity. Attributes may be any of a variety of different types,e.g., integer, float, Boolean, string, etc. The attributes may beordered. In addition to attributes associated with an entity, a tuplemay include metadata, i.e., data about the tuple. A tuple may beextended by adding one or more additional attributes or metadata to it.As used herein, “stream” or “data stream” refers to a sequence oftuples. Generally, a stream may be considered a pseudo-infinite sequenceof tuples.

Tuples are received and output by stream operators and processingelements. An input tuple corresponding with a particular entity that isreceived by a stream operator or processing element, however, isgenerally not considered to be the same tuple that is output by thestream operator or processing element, even if the output tuplecorresponds with the same entity or data as the input tuple. An outputtuple need not be changed from the input tuple.

Nonetheless, an output tuple may be changed in some way by a streamoperator or processing element. An attribute or metadata may be added,deleted, or modified. For example, a tuple will often have two or moreattributes. A stream operator or processing element may receive thetuple having multiple attributes and output a tuple corresponding withthe input tuple. The stream operator or processing element may onlychange one of the attributes so that all of the attributes of the outputtuple except one are the same as the attributes of the input tuple.

Generally, a particular tuple output by a stream operator or processingelement may not be considered to be the same tuple as a correspondinginput tuple even if the input tuple is not changed by the processingelement. However, to simplify the present description and the claims, anoutput tuple that has the same data attributes or is associated with thesame entity as a corresponding input tuple will be referred to herein asthe same tuple unless the context or an express statement indicatesotherwise.

Stream computing applications handle massive volumes of data that needto be processed efficiently and in real time. For example, a streamcomputing application may continuously ingest and analyze hundreds ofthousands of messages per second and up to petabytes of data per day.Accordingly, each stream operator in a stream computing application mayprocess a received tuple within fractions of a second. Unless the streamoperators are located in the same processing element, an inter-processcommunication path can be used each time a tuple is sent from one streamoperator to another. Inter-process communication paths can be a resourcein a stream computing application.

An operator graph can be an execution path for a plurality of streamoperators to process a stream of tuples. In addition to streamoperators, the operator graph can refer to an execution path forprocessing elements and the dependent stream operators of the processingelements to process the stream of tuples. Generally, the operator graphcan have a plurality of stream operators that produce a particular endresult, e.g., calculate an average. An operator graph may be a lineararrangement of processing elements and/or operators, or it may includeone or more distinct execution paths, also known as sub-processes,methods, or branches.

FIG. 2 illustrates a stream computing infrastructure 200 that may beconfigured to execute a stream computing application, according toembodiments. The stream computing infrastructure 200 includes amanagement system 205 and two or more compute nodes 210A-210D—i.e.,hosts—which are communicatively coupled to each other using one or morecommunications networks 220. The management system 205 can include anoperator graph 232, a stream manager 234, and a impute engine 238. Thecommunications network 220 may include one or more servers, networks, ordatabases, and may use a particular communication protocol to transferdata between the compute nodes 210A-210D. A development system 202 maybe communicatively coupled with the management system 205 and thecompute nodes 210 either directly or via the communications network 220.In some embodiments, the stream computing infrastructure 200 is anentirely server-side environment for processing and analyzing tuples(e.g., a data stream management system). Therefore, for example, userdevices (e.g., mobile phones) in some embodiments may not affect orperform any of the processes as described herein. Accordingly, two ormore operators that are processing tuples may be included within thesame server system and not within any client devices.

The communications network 220 may include a variety of types ofphysical communication channels or “links.” The links may be wired,wireless, optical, or any other suitable media. In addition, thecommunications network 220 may include a variety of network hardware andsoftware for performing routing, switching, and other functions, such asrouters, switches, or bridges. The communications network 220 may bededicated for use by a stream computing application or shared with otherapplications and users. The communications network 220 may be any size.For example, the communications network 220 may include a single localarea network or a wide area network spanning a large geographical area,such as the Internet. The links may provide different levels ofbandwidth or capacity to transfer data at a particular rate. Thebandwidth that a particular link provides may vary depending on avariety of factors, including the type of communication media andwhether particular network hardware or software is functioning correctlyor at full capacity. In addition, the bandwidth that a particular linkprovides to a stream computing application may vary if the link isshared with other applications and users. The available bandwidth mayvary depending on the load placed on the link by the other applicationsand users. The bandwidth that a particular link provides may also varydepending on a temporal factor, such as time of day, day of week, day ofmonth, or season.

Any of the compute nodes 210 may be configured the same as or analogousto the compute node 106 and/or or the computing device 600 asillustrated in FIG. 6. The compute nodes 210 may include one or morestream operators. A stream computing application may include one or morestream operators that may be compiled into a “processing element”container. Two or more processing elements may run on the same memory,each processing element having one or more stream operators. Each streamoperator may include a portion of code that processes tuples flowinginto a processing element and outputs tuples to other stream operatorsin the same processing element, in other processing elements, or in boththe same and other processing elements in a stream computingapplication. Processing elements may pass tuples to other processingelements that are on the same compute node 110 or on other compute nodesthat are accessible via communications network 120. For example, aprocessing element on compute node 110A may output tuples to aprocessing element on compute node 110B.

The tuple received by a particular processing element is generally notconsidered to be the same tuple that is output downstream. Typically,the output tuple is changed in some way. An attribute or metadata may beadded, deleted, or changed. However, it is not required that the outputtuple be changed in some way. Generally, a particular tuple output by aprocessing element may not be considered to be the same tuple as acorresponding input tuple even if the input tuple is not changed by theprocessing element. However, to simplify the present description and theclaims, an output tuple that has the same data attributes as acorresponding input tuple may be referred to herein as the same tuple.

The management system 205 may be configured the same as or analogous tothe compute nodes 106, 210, and/or the computing device 600. Themanagement system may include the operator graph 232, the stream manager234, and the impute engine 238. The operator graph 232 may define howtuples are routed to processing elements for processing. Because aprocessing element may be a collection of fused stream operators, it isequally correct to describe the operator graph 232 as one or moreexecution paths between specific stream operators, which may includeexecution paths to different stream operators within the same processingelement.

An illustrative operator graph 232 for a stream computing applicationmay begin from one or more sources through to one or more sinks,according to some embodiments. This flow from source to sink may also begenerally referred to herein as an execution path. In addition, a flowfrom one processing element to another may be referred to as anexecution path in various contexts. The operator graph 232 may includedata flows between stream operators within the same or differentprocessing elements. Typically, processing elements receive tuples fromthe stream as well as output tuples into the stream (except for asink—where the stream terminates, or a source—where the stream begins).While the operator graph 232 includes a relatively small number ofcomponents, an operator graph may be much more complex and may includemany individual operator graphs that may be statically or dynamicallylinked together.

The stream manager 234 of FIG. 2 may be configured to monitor a streamcomputing application running on compute nodes, e.g., compute nodes210A-210D, change the deployment of an operator graph, e.g., operatorgraph 232. The stream manager 234 may move processing elements from onecompute node 110 to another, for example, to manage the processing loadsof the compute nodes 210A-210D in the stream computing infrastructure200. Further, stream manager 234 may control the stream computingapplication by inserting, removing, fusing, un-fusing, or otherwisemodifying the processing elements and stream operators (or what tuplesflow to the processing elements) on the compute nodes 210A-210D.

In embodiments, the impute engine 238 imputes values in response to oras a part of stream joins. Stream joins relate or link information fromdifferent streams. For example, a first stream of tuples may flow fromcompute node 210D to 210C and a second stream of tuples may flow fromcompute node 210A to compute node 210B. The impute engine 238 may jointhe first and second streams of data for user analysis, such ascorrelation analysis, pattern identification, etc. In some embodiments,the impute engine 238 is included in the stream manager 234. In anillustrative example, a first stream of tuples may be received to beprocessed by a plurality of processing elements operating on one or morecomputer processors, each processing element may have one or more streamoperators. The plurality of processing elements may form an operatorgraph in which the tuples flow between the one or more stream operators.The operator graph may define one or more execution paths for processingand routing the stream of tuples. Each processing element may have anassociated memory space. A second set of tuples may also be received. Arequest may be received to join the first stream with the second stream.The first stream may be sampled at a first time series interval. Thesecond stream may be sampled at a second time series interval. It may bedetermined that there are one or more null results associated with therequest. In response to the determining, a prediction estimate may begenerated of what set of values would be represented by the one or morenull results had the first stream been sampled at the second time seriesinterval. The prediction estimate is described in more detail below.

Streams may be joined in any suitable manner, such as count-basedjoining (e.g., joining every N^(th) tuple), attribute-based (joiningattributes X,Y from different streams), and window-based (e.g., joiningparticular tuples at X timestamp boundaries). The functions of theimpute engine 238 are described in more detail below, such as in FIG. 5.

FIG. 3 is relational database table diagram of a temporal left joinoperation, according to embodiments. Tables A and B (e.g., 302A) may bedata structures that hold discrete sets of data in a relational databasefile. Table A may represent a unique set of data that is sampled orpopulated at a different time interval when compared to Table B. Table Aincludes three columns—the “timestamp ID” column (which is the primarykey of table A), the timestamp column, and the stock price column. Thereare four records or rows corresponding to timestamp ID 1-4. The“timestamp” column indicates the date and clock time data was sampled orpopulated in the corresponding table. For example, for timestamp ID 1,on 7/9/2016 at 8:00, the stock price was 150. It is understood by one ofordinary skill in the art that Table A and B are illustrative orrepresentative only. Accordingly, for example, there may be more orfewer records and different columns, etc. Table A illustrates that thestock price is sampled, obtained, or populated every 30 minutes startingat 8:00 until 9:30.

Table B includes four columns—a “temperature ID” column (which is theprimary key of table B), a “temperature” column, a “timestamp ID” column(which is the foreign key in table B), and a “timestamp” column. FIG. 3illustrates three different states of Table B at three different times(302A, 302B, and 302C), in accordance with a left join table request.Table 302A may represent the earliest time period. Table 302B mayrepresent an intermediate time period. And table 302C may representtable B at a latest time period. For example at a first time, table 302Amay represent data as it appears before a left join request. Therefore,before a left join request, the temperature may be sampled at every hourstarting from 8:00 until 11:00, and temperature values may be sampledaccordingly. For example, a temperature sensor may be located in anenvironment and may be configured to measure the ambient temperature ofthe environment every hour. The sensor may be coupled to a radio suchthat it may transmit the temperature measurement value to a databasethat includes the table B, whereby a database management system maypopulate table B with the corresponding values accordingly. Thus, forexample, on 7/9/2016 at 8:00, the temperature may be sampled at 80degrees and populated in table 302A accordingly.

At a second time, a request to perform a left join of table A and tableB in its 302A state may be issued. A left join selects each and everyrecord from table A, along with records of table B for which a joincondition is met (if at all), including any null results. For example, auser may issue a SQL left join query request, such as:

SELECT: *

FROM: Table A

LEFT JOIN: Table B

ON: Table A timestamp=Table B timestamp

WHERE: results are null.

Accordingly, each attribute of table A may be selected—timestamp, andstock price—and left joined with table B's temperature data andequi-joined at the same timestamp interval as table A where the resultsare NULL in table 302B. As disclosed herein the term, “null,” “nullfield,” “null record,” or “null results” indicates that value(s) areabsent/missing/unknown/do not exist compared to values for the sameattribute. Therefore, subsequent to the left join request, table B maybe populated or organized according to the table 302B. Because data issampled in table B at every hour and not every half hour such as intable A, there are two NULL fields/records for the second and fourthrecord of the “temperature” column in table 302B. Accordingly, at 8:30it is unknown what the temperature is because the temperature was notsampled or populated at that time. Likewise, at 9:30 it is unknown whatthe temperature is because the temperature was not sampled or populatedat that time.

At a third subsequent time, the NULL results (e.g., fields/records) maybe imputed with values according to table 302C. To “impute” may refer toinserting value(s) to stand in for missing or null data. Table 302Cindicates that for the timestamp of 8:30, the temperature may beinferred to be 81.5 and this value may be imputed accordingly. And at9:30, the temperature may be inferred to be 85.5 and imputedaccordingly. The various algorithms and techniques for estimating andimputing values are described in more detail below.

At a fourth subsequent time, the “left join table” may be presented ordisplayed to a user as the completion of the left join request.Accordingly, the database manager may perform a left join operation oftable A and the 302B version of table B to arrive at the “left jointable.” Therefore, for every half hour time series intervalsampling—8:00, 8:30, 9:00, 9:30—according to table A, readings for boththe stock price and temperature may be populated in the left join table.The imputed values of 81.5 and 85.5 may be included in the left jointable. This is different than typical left join operations, which maypresent the left join table with the null values still indicated.However, just because data is missing or null, does not mean that itdoes not or should not exist. Data may be advantageously imputed forvarious data analyses, such as mean, median, variance, standarddeviation, etc. across an entire time series spectrum to obtain moreaccurate analyses. For example, the variance of the temperature valuesmay be computed for each of the half hour time intervals to initiatemore complex analyses of the data.

FIG. 4A is a time series diagram illustrating how data may be estimatedand imputed, according to embodiments. FIG. 4A illustrates that data maybe imputed according to last-known-value methods. Time series 1represents a first time interval of a first data structure and timeseries 2 represents a different time interval of a second datastructure. Points 401, 403, 405, 407, and 409 all represent differentpoints in time (e.g., timestamps) along the time series'. As illustratedin FIG. 3 above, in some instances, time may be measured at differentintervals for data included in the same data store. In order to alignthe data corresponding to different time series, various estimation andimputation methods may be performed.

According to FIG. 4A, if there are null results for data obtained at aparticular interval, the last value corresponding to an earlier timeseries may be imputed in the null field. In an illustrative example,time series 1 may represent that data is sampled every half hour. Point401 may represent 8:00, points 405 may represent 8:30, and point 407 mayrepresent 9:00. Time series 2 may represent that data is sampled everyhour (as opposed every half hour). Point 403 may represent 8:00, andpoint 409 may represent 9:00. Upon a join request associated with thetwo different time series, data may be imputed for point 405 (8:30)since this point does not exist for the time series 2 interval. Theimputation method may insert values measured or populated at the points401 and 403 (8:00—the “last known value”). This imputation method mayinclude identifying and imputing a value that was populated at a mostrecent time period. For example, referring back to FIG. 3, instead ofthe value of 81.5 being imputed within the second record and temperaturecolumn for the table 302C, the value of 80 may be imputed, as it was thelast known value—at 8:00, the temperature was 80 according to the firstrecord of table 302C.

FIG. 4B is also a time series diagram illustrating how data may beestimated and imputed, according to embodiments. FIG. 4B illustratesthat instead of imputing a value according to values already specifiedat a particular point in time (e.g., FIG. 4A), a separate value may beestimated and imputed based on various calculations. Time series 1 andtime series 2 of FIG. 4B represent two different time series intervals.For example, points 411, 415, and 419 may represent timestamps populatedin a data structure every 30 minutes, such as 8:00, 8:30, and 9:00.Points 413 and 421 initially represent data gathered a longer timeperiod compared to time series 1, such as every hour (e.g., point413=8:00; point 421=9:00). Point 417 may correspond to an imputed valueat a time series point equal to point 415, such as 8:30.

Various algorithms may be utilized to impute values associated withpoint 417. For example, interpolation algorithms such as linear splineinterpolation and/or cubic spline interpolation may be utilized. Linearinterpolation uses a linear function for each data point. A spline is apolynomial between each pair of tabulated points (e.g., the temperaturecolumn values of table 302A in FIG. 3) given a tabulated functionƒ_(k)=ƒ (x_(k)), k=0, N. In each interval (x_(k), x_(k+1)), a straightline can be fit through the tabulated points (x_(k)ƒ_(k)) and (x_(k+1),ƒ_(k+1)) using the interpolation formula:ƒ=Aƒ _(k) +Bƒ _(k+1)  Equation 1where

$\begin{matrix}{{A \equiv \frac{x_{k + 1} - x}{x_{k + 1} - x_{k}}},{{B \equiv {1 - A}} = {\frac{x - x_{k}}{x_{k + 1} - x_{k}}.}}} & {{Equation}\mspace{14mu} 2}\end{matrix}$

Cubic spline interpolation produces an interpolated function that iscontinuous through to a second derivative. The derivatives maycorrespond to the rate of change between values (e.g., the temperaturevalues in table 302A). In addition to the tabulated values of ƒ_(i),there are also tabulated values for the function's second derivatives(i.e., a set of numbers ƒ_(i)″). Then within each interval (x_(k),x_(k+1)), a cubic polynomial can be added to the right hand side ofequation 1 whose second derivative varies linearly from a value ƒ_(k)″on the left to a value ƒ_(k+1)″ on the right, which may make thecontinuous second derivative. Equation 1 may be replaced by:ƒ=Aƒ _(k) +Bƒ _(k+1) +Cƒ _(k) ″+Dƒ _(k+1)″  Equation 3where A and B are defined as in equation 1 and

$\begin{matrix}{{C \equiv {\frac{1}{6}\left( {A^{3} - A} \right)\left( {x_{k + 1} - x_{k}} \right)^{2}}},{D \equiv {\frac{1}{6}\left( {B^{3} - B} \right){\left( {x_{k + 1} - x_{k}} \right)^{2}.}}}} & {{Equation}\mspace{14mu} 4}\end{matrix}$

The derivatives of equation 3 with respect to x may be taken usingdefinition of A, B, C, and D to compute dA/dx, dB/dx, dC/dx, and d/D/dx.The result is:

$\begin{matrix}{\frac{df}{dx} = {\frac{f_{k + 1} - f_{k}}{x_{k + 1} - x_{k}} - {\frac{{3A^{2}} - 1}{6}\left( {x_{k + 1} - x_{k}} \right)f_{k}^{''}} + {\frac{{3B^{2}} - 1}{6}\left( {x_{k + 1} - x_{k}} \right)f_{k + 1}^{''}}}} & {{Equation}\mspace{14mu} 5}\end{matrix}$for the first derivative and

$\begin{matrix}{\frac{d^{2}f}{{dx}^{2}} = {{Af}_{k}^{''} + {Bf}_{k + 1}^{''}}} & {{Equation}\mspace{14mu} 6}\end{matrix}$for the second derivative.

The required equations for cubic spline interpolation are obtained bysetting equation 5 evaluated for x=x_(k) in the interval (x_(k−1),x_(k)) equal to the same equation evaluated for x=x_(k) but in theinterval (x_(k), x_(k+1)). This gives:

$\begin{matrix}{{{\frac{x_{k} - x_{k - 1}}{6}f_{k - 1}^{''}} + {\frac{x_{k + 1} - x_{k - 1}}{3}f_{k}^{''}} + {\frac{x_{k + 1} - x_{k}}{6}f_{k + 1}^{''}}} = {\frac{f_{k + 1} - f_{k}}{x_{k + 1} - x_{k}} - {\frac{f_{k} - f_{k - 1}}{x_{k} - x_{k - 1}}.}}} & {{Equation}\mspace{14mu} 7}\end{matrix}$Accordingly, new value(s) associated with point 417 may be generated asan accurate prediction estimate of what the value associated with point417 would have been had the data been measured at point 417 based on thetime intervals of the time series and the actual data points through theequations explained above. Therefore, an imputation method may bespecified such that at least one imputation function is applied to oneor more time series data structures.

The estimations as explained above may include or instead be calculatedby other methods. For example, instead of the last known value methodbeing utilized as illustrated in FIG. 4A, other known values may beimputed, such as the next future value. Thus, as illustrated in FIG. 4A,instead of imputing the value associated with time point 403 for timepoint 405, a value may be imputed that is associated with time point409. In some embodiments, an average or mean calculation between two ormore points may be utilized to impute data. For example, referring backto FIG. 3, in order to determine what the fourth record of thetemperature value should be in table 302C, an average between the 9:00temperature value of 83 and 10:00 temperature value of 82 may beaveraged, which is 82.5. Any other suitable methods may be utilized.

FIG. 5 is a flow diagram of an example process 500 for imputing data aspart of a join operation, according to embodiments. The process 500begins at block 502 when a data store join request is received. Forexample, a user may issue a SQL join request query. Consistent withembodiments, any suitable type of join operation may be performed orrequested, such as a left join, right join, inner join, full join, selfjoin, non-equi-join, Cartesian join, nested-loop, explicit partitioning,explicit sorting, timestamp sorting, timestamp partitioning,sliding-window stream join, symmetric hash joins, double pipelined hashjoins, hash merge join, and/or progressive merge join.

Per block 504, it may be determined (e.g., by the impute engine 126,238) whether there are any null or missing values associated with joineddata structure. If there are no null results, such as in an inner joinrequest, then the process 500 may proceed to block 510 where the joineddata structure is presented. If there are any null or missing values(e.g., the table 302B of FIG. 3), then according to block 506 anestimate may be generated (e.g., by the impute engine 126, 238) of whatthe null or missing values should be. The estimation at block 506 maycorrespond to a prediction of what value a null field would be had asampling of data occurred at a particular time interval that was notoriginally specified in the data structure. The estimation value may bebased on patterns and associations made with the rest of the data in thedata structure. For example, interpolation algorithms andlast-known-value methods may be utilized to predict these values asspecified in FIGS. 4A and 4B above.

Per block 508, data may be imputed (e.g., by the impute engine 126, 238)into the null or missing fields based on the estimation generation thatoccurred at block 506. For example, based on making interpolationcalculations, null SQL fields may be populated or imputed with theinterpolation results.

Per block 510, the joined data structure may be presented. For example,referring back to FIG. 3, the left join table with a timestamp column, astock price column, and a temperature column may be displayed to acomputing device such that a user may view the joined data structurethey requested.

FIG. 6 is a block diagram of a computing device 600 that includes aimpute engine 626, according to embodiments. The components of thecomputing device 600 can include one or more processors 06, a memory 12,a terminal interface 18, a storage interface 20, an Input/Output (“I/O”)device interface 22, and a network interface 24, all of which arecommunicatively coupled, directly or indirectly, for inter-componentcommunication via a memory bus 10, an I/O bus 16, bus interface unit(“IF”) 08, and an I/O bus interface unit 14.

The computing device 600 may include one or more general-purposeprogrammable central processing units (CPUs) 06A and 06B, hereingenerically referred to as the processor 06. In an embodiment, thecomputing device 600 may contain multiple processors; however, inanother embodiment, the computing device 600 may alternatively be asingle CPU device. Each processor 06 executes instructions stored in thememory 12 (e.g., the impute engine 626).

The computing device 600 may include a bus interface unit 08 to handlecommunications among the processor 06, the memory 12, the display system04, and the I/O bus interface unit 14. The I/O bus interface unit 14 maybe coupled with the I/O bus 16 for transferring data to and from thevarious I/O units. The I/O bus interface unit 14 may communicate withmultiple I/O interface units 18, 20, 22, and 24, which are also known asI/O processors (IOPs) or I/O adapters (IOAs), through the I/O bus 16.The display system 04 may include a display controller, a displaymemory, or both. The display controller may provide video, audio, orboth types of data to a display device 02. The display memory may be adedicated memory for buffering video data. The display system 04 may becoupled with a display device 02, such as a standalone display screen,computer monitor, television, a tablet or handheld device display, oranother other displayable device. In an embodiment, the display device02 may include one or more speakers for rendering audio. Alternatively,one or more speakers for rendering audio may be coupled with an I/Ointerface unit. In alternate embodiments, one or more functions providedby the display system 04 may be on board an integrated circuit that alsoincludes the processor 06. In addition, one or more of the functionsprovided by the bus interface unit 08 may be on board an integratedcircuit that also includes the processor 06.

The I/O interface units support communication with a variety of storageand I/O devices. For example, the terminal interface unit 18 supportsthe attachment of one or more user I/O devices, which may include useroutput devices (such as a video display devices, speaker, and/ortelevision set) and user input devices (such as a keyboard, mouse,keypad, touchpad, trackball, buttons, light pen, or other pointingdevices). A user may manipulate the user input devices using a userinterface, in order to provide input data and commands to the user I/Odevice 26 and the computing device 600, may receive output data via theuser output devices. For example, a user interface may be presented viathe user I/O device 26, such as displayed on a display device, playedvia a speaker, or printed via a printer.

The storage interface 20 supports the attachment of one or more diskdrives or direct access storage devices 28 (which are typically rotatingmagnetic disk drive storage devices, although they could alternativelybe other storage devices, including arrays of disk drives configured toappear as a single large storage device to a host computer, orsolid-state drives, such as a flash memory). In another embodiment, thestorage device 28 may be implemented via any type of secondary storagedevice. The contents of the memory 12, or any portion thereof, may bestored to and retrieved from the storage device 28 as needed. Thestorage devices 28 may be employed to store any of the databasesdescribed herein, including databases 110, 112, and 114. The I/O deviceinterface 22 provides an interface to any of various other I/O devicesor devices of other types, such as printers or fax machines. The networkinterface 24 provides one or more communication paths from the computingdevice 600 to other digital devices and computer systems.

Although the computing device 600 shown in FIG. 6 illustrates aparticular bus structure providing a direct communication path among theprocessors 06, the memory 12, the bus interface 08, the display system04, and the I/O bus interface unit 14, in alternative embodiments thecomputing device 600 may include different buses or communication paths,which may be arranged in any of various forms, such as point-to-pointlinks in hierarchical, star or web configurations, multiple hierarchicalbuses, parallel and redundant paths, or any other appropriate type ofconfiguration. Furthermore, while the I/O bus interface unit 14 and theI/O bus 08 are shown as single respective units, the computing device600, may include multiple I/O bus interface units 14 and/or multiple I/Obuses 16. While multiple I/O interface units are shown, which separatethe I/O bus 16 from various communication paths running to the variousI/O devices, in other embodiments, some or all of the I/O devices areconnected directly to one or more system I/O buses.

In various embodiments, the computing device 600 is a multi-usermainframe computer system, a single-user system, or a server computer orsimilar device that has little or no direct user interface, but receivesrequests from other computer systems (clients). In other embodiments,the computing device 600 may be implemented as a desktop computer,portable computer, laptop or notebook computer, tablet computer, pocketcomputer, telephone, smart phone, or any other suitable type ofelectronic device. The computing device 600 may be any of the computenodes 102, 104, and/or 106 of FIG. 1.

In an embodiment, the memory 12 may include a random-accesssemiconductor memory, storage device, or storage medium (either volatileor non-volatile) for storing or encoding data and programs. In anotherembodiment, the memory 12 represents the entire virtual memory of thecomputing device 600, and may also include the virtual memory of othercomputer systems coupled to the computing device 600 or connected via anetwork 30. The memory 12 may be a single monolithic entity, but inother embodiments the memory 12 may include a hierarchy of caches andother memory devices. For example, memory may exist in multiple levelsof caches, and these caches may be further divided by function, so thatone cache holds instructions while another holds non-instruction data,which is used by the processor. Memory 12 may be further distributed andassociated with different CPUs or sets of CPUs, as is known in anyvarious so-called non-uniform memory access (NUMA) computerarchitectures.

The memory 12 may store all or a portion of the components and data(e.g., the impute engine 626) shown in FIG. 6. These programs and dataare illustrated in FIG. 6 as being included within the memory 12 in thecomputing device 600; however, in other embodiments, some or all of themmay be on different computer systems and may be accessed remotely, e.g.,via a network 30. The computing device 600 may use virtual addressingmechanisms that allow the programs of the computing device 600 to behaveas if they only have access to a large, single storage entity instead ofaccess to multiple, smaller storage entities. Thus, while the componentsand data shown in FIG. 6 are illustrated as being included within thememory 12, these components and data are not necessarily all completelycontained in the same storage device at the same time. Although thecomponents and data shown in FIG. 6 are illustrated as being separateentities, in other embodiments some of them, portions of some of them,or all of them may be packaged together.

In some embodiments, the memory 12 may include program instructions ormodules, such as the impute engine 626. The impute engine 626 may be theimpute engine 126 of FIG. 1 and/or the impute engine 238 of FIG. 2. Insome embodiments, the impute engine 626 performs some or each of thefunctions as described in FIGS. 3, 4A, 4B, and/or 5.

In an embodiment, the components and data shown in FIG. 6 (e.g., theimpute engine 626) may include instructions or statements that executeon the processor 06 or instructions or statements that are interpretedby instructions or statements that execute on the processor 06 to carryout the functions as described above. In another embodiment, thecomponents shown in FIG. 6 may be implemented in hardware viasemiconductor devices, chips, logical gates, circuits, circuit cards,and/or other physical hardware devices in lieu of, or in addition to, aprocessor-based system. In an embodiment, the components shown in FIG. 6may include data in addition to instructions or statements.

FIG. 6 is intended to depict representative components of the computingdevice 600. Individual components, however, may have greater complexitythan represented in FIG. 6. In FIG. 6, components other than or inaddition to those shown may be present, and the number, type, andconfiguration of such components may vary. Several particular examplesof additional complexity or additional variations are disclosed herein;these are by way of example only and are not necessarily the only suchvariations. The various program components illustrated in FIG. 6 may beimplemented, in various embodiments, in a number of different ways,including using various computer applications, routines, components,programs, objects, modules, data pages etc., which may be referred toherein as “software,” “computer programs,” or simply “programs.”

Aspects of the present invention may be a system, a method, and/or acomputer program product. The computer program product may include acomputer readable storage medium (or media) having computer readableprogram instructions thereon for causing a processor to carry outaspects of the various embodiments.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofembodiments of the present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, or either source code or object code written in anycombination of one or more programming languages, including an objectoriented programming language such as Smalltalk, C++ or the like, andconventional procedural programming languages, such as the “C”programming language or similar programming languages. The computerreadable program instructions may execute entirely on the user'scomputer, partly on the user's computer, as a stand-alone softwarepackage, partly on the user's computer and partly on a remote computeror entirely on the remote computer or server. In the latter scenario,the remote computer may be connected to the user's computer through anytype of network, including a local area network (LAN) or a wide areanetwork (WAN), or the connection may be made to an external computer(for example, through the Internet using an Internet Service Provider).In some embodiments, electronic circuitry including, for example,programmable logic circuitry, field-programmable gate arrays (FPGA), orprogrammable logic arrays (PLA) may execute the computer readableprogram instructions by utilizing state information of the computerreadable program instructions to personalize the electronic circuitry,in order to perform aspects of embodiments of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the block may occur out of theorder noted in the figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

The descriptions of the various embodiments of the present inventionhave been presented for purposes of illustration, but are not intendedto be exhaustive or limited to the embodiments disclosed. Manymodifications and variations will be apparent to those of ordinary skillin the art without departing from the scope and spirit of the describedembodiments. The terminology used herein was chosen to explain theprinciples of the embodiments, the practical application or technicalimprovement over technologies found in the marketplace, or to enableothers of ordinary skill in the art to understand the embodimentsdisclosed herein.

What is claimed is:
 1. A computer-implemented method for performing atemporal join operation in a database, the method comprising:populating, into a first table of a relational database, a first set ofdatabase records with a first set of values of a first column, the firstset of database values are gathered by a first sensor at a first timeseries interval, the first time series interval including a plurality ofpoints in time the first set of database values were sampled;populating, under a first timestamp column of the first table, each ofthe first set of database records with a respective point of theplurality of points in time that a particular value of the first set ofdatabase values were sampled; populating, into a second table of therelational database, a second set of database records with a second setof values of a second column, the second set of database values aregathered by a second sensor at a second time series interval, the secondtime series interval including a second plurality of points in time thesecond set of database values were sampled, wherein the first timeseries interval is different than the second time series interval;populating, under a second timestamp column of the second table, each ofthe second set of database records with a respective point of theplurality of points in time that a particular value of the second set ofdatabase values were sampled; receiving, subsequent to the populating ofthe first table and the second table, a Structured Query Language (SQL)query request to perform a temporal left join operation of the first andsecond columns of the first and second tables, the temporal left joinoperation causes the first set of values of the first column and thesecond set of values of the second column to be displayed together in asingle view in a third table, the SQL query request specifies selectingall attributes from the first table and left joining the second columnof the second table and equi-joining both the first set of values andthe second set of values at the first time series interval where one ormore results are null in the second table, wherein the one or more nullresults indicates that one or more values are missing; determining thatthere are a plurality of null values for a plurality of fields in thesecond table for the second column; generating, in response to thedetermining that there are the plurality of null values for theplurality of fields in the second table for the second column, aprediction estimate of what third set of values would be implemented inthe plurality of fields for the plurality of null values had at leastone of the third set of values been sampled at the first time interval,the generating of the prediction estimate includes utilizing a cubicspline interpolation estimation, the cubic spline interpolationestimation producing an interpolated function that is continuous throughat least two derivatives; imputing, in response to the generating of theprediction estimate, the third set of values into the plurality offields, wherein there are no null values in the plurality of fieldssubsequent to the imputing; and displaying, in response to the imputingthe third set of values into the plurality of fields, the third table,the third table including: the first set of values under the firstcolumn, the second set of values under the second column, the third setof values under the second column, and the first time series interval.